Linear Regression with Scikit Learn - Machine Learning with Python

This tutorial is a part of Zero to Data Science Bootcamp by Jovian and Machine Learning with Python: Zero to GBMs

The following topics are covered in this tutorial:

  • A typical problem statement for machine learning
  • Downloading and exploring a dataset for machine learning
  • Linear regression with one variable using Scikit-learn
  • Linear regression with multiple variables
  • Using categorical features for machine learning
  • Regression coefficients and feature importance
  • Other models and techniques for regression using Scikit-learn
  • Applying linear regression to other datasets

Problem Statement

This tutorial takes a practical and coding-focused approach. We'll define the terms machine learning and linear regression in the context of a problem, and later generalize their definitions. We'll work through a typical machine learning problem step-by-step:

QUESTION: ACME Insurance Inc. offers affordable health insurance to thousands of customer all over the United States. As the lead data scientist at ACME, you're tasked with creating an automated system to estimate the annual medical expenditure for new customers, using information such as their age, sex, BMI, children, smoking habits and region of residence.

Estimates from your system will be used to determine the annual insurance premium (amount paid every month) offered to the customer. Due to regulatory requirements, you must be able to explain why your system outputs a certain prediction.

You're given a CSV file containing verified historical data, consisting of the aforementioned information and the actual medical charges incurred by over 1300 customers.

Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets

EXERCISE: Before proceeding further, take a moment to think about how can approach this problem. List five or more ideas that come to your mind below:

  1. ???
  2. ???
  3. ???
  4. ???
  5. ???

Downloading the Data

To begin, let's download the data using the urlretrieve function from urllib.request.

In [1]:
# medical_charges_url = 'https://raw.githubusercontent.com/JovianML/opendatasets/master/data/medical-charges.csv'
In [2]:
# from urllib.request import urlretrieve
In [3]:
# urlretrieve(medical_charges_url, 'medical.csv')

We can now create a Pandas dataframe using the downloaded file, to view and analyze the data.

In [4]:
import pandas as pd
In [5]:
medical_df = pd.read_csv('medical.csv')
In [6]:
medical_df.head(3)
Out[6]:
age sex bmi children smoker region charges
0 19 female 27.90 0 yes southwest 16884.9240
1 18 male 33.77 1 no southeast 1725.5523
2 28 male 33.00 3 no southeast 4449.4620

The dataset contains 1338 rows and 7 columns. Each row of the dataset contains information about one customer.

Our objective is to find a way to estimate the value in the "charges" column using the values in the other columns. If we can do so for the historical data, then we should able to estimate charges for new customers too, simply by asking for information like their age, sex, BMI, no. of children, smoking habits and region.

Let's check the data type for each column.

In [7]:
medical_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
age         1338 non-null int64
sex         1338 non-null object
bmi         1338 non-null float64
children    1338 non-null int64
smoker      1338 non-null object
region      1338 non-null object
charges     1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.2+ KB

Looks like "age", "children", "bmi" (body mass index) and "charges" are numbers, whereas "sex", "smoker" and "region" are strings (possibly categories). None of the columns contain any missing values, which saves us a fair bit of work!

Here are some statistics for the numerical columns:

In [8]:
medical_df.describe()
Out[8]:
age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010

The ranges of values in the numerical columns seem reasonable too (no negative ages!), so we may not have to do much data cleaning or correction. The "charges" column seems to be significantly skewed however, as the median (50 percentile) is much lower than the maximum value.

EXERCISE: What other inferences can you draw by looking at the table above? Add your inferences below:

  1. ???
  2. ???
  3. ???
  4. ???
  5. ???

Let's save our work before continuing.

Exploratory Analysis and Visualization

Let's explore the data by visualizing the distribution of values in some columns of the dataset, and the relationships between "charges" and other columns.

We'll use libraries Matplotlib, Seaborn and Plotly for visualization. Follow these tutorials to learn how to use these libraries:

In [9]:
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The following settings will improve the default style and font sizes for our charts.

In [10]:
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Age

Age is a numeric column. The minimum age in the dataset is 18 and the maximum age is 64. Thus, we can visualize the distribution of age using a histogram with 47 bins (one for each year) and a box plot. We'll use plotly to make the chart interactive, but you can create similar charts using Seaborn.

In [11]:
medical_df.age.describe()
Out[11]:
count    1338.000000
mean       39.207025
std        14.049960
min        18.000000
25%        27.000000
50%        39.000000
75%        51.000000
max        64.000000
Name: age, dtype: float64
In [12]:
fig = px.histogram(medical_df, 
                   x='age', 
                   marginal='box', 
                   nbins=47, 
                   title='Distribution of Age')
fig.update_layout(bargap=0.1)
fig.show()

The distribution of ages in the dataset is almost uniform, with 20-30 customers at every age, except for the ages 18 and 19, which seem to have over twice as many customers as other ages. The uniform distribution might arise from the fact that there isn't a big variation in the number of people of any given age (between 18 & 64) in the USA.

EXERCISE: Can you explain why there are over twice as many customers with ages 18 and 19, compared to other ages?

???

Body Mass Index

Let's look at the distribution of BMI (Body Mass Index) of customers, using a histogram and box plot.

In [13]:
fig = px.histogram(medical_df, 
                   x='bmi', 
                   marginal='box', 
                   color_discrete_sequence=['red'], 
                   title='Distribution of BMI (Body Mass Index)')
fig.update_layout(bargap=0.1)
fig.show()

The measurements of body mass index seem to form a Gaussian distribution centered around the value 30, with a few outliers towards the right. Here's how BMI values can be interpreted (source):

EXERCISE: Can you explain why the distribution of ages forms a uniform distribution while the distribution of BMIs forms a gaussian distribution?

???

Charges

Let's visualize the distribution of "charges" i.e. the annual medical charges for customers. This is the column we're trying to predict. Let's also use the categorical column "smoker" to distinguish the charges for smokers and non-smokers.

In [14]:
fig = px.histogram(medical_df, 
                   x='charges', 
                   marginal='box', 
                   color='smoker', 
                   color_discrete_sequence=['green', 'grey'], 
                   title='Annual Medical Charges')
fig.update_layout(bargap=0.1)
fig.show()

We can make the following observations from the above graph:

  • For most customers, the annual medical charges are under \$10,000. Only a small fraction of customer have higher medical expenses, possibly due to accidents, major illnesses and genetic diseases. The distribution follows a "power law"
  • There is a significant difference in medical expenses between smokers and non-smokers. While the median for non-smokers is \$7300, the median for smokers is close to \\$35,000.

EXERCISE: Visualize the distribution of medical charges in connection with other factors like "sex" and "region". What do you observe?

In [ ]:
 
In [ ]:
 
In [ ]:
 

Smoker

Let's visualize the distribution of the "smoker" column (containing values "yes" and "no") using a histogram.

In [15]:
medical_df.smoker.value_counts()
Out[15]:
no     1064
yes     274
Name: smoker, dtype: int64
In [16]:
px.histogram(medical_df, x='smoker', color='sex', title='Smoker')

It appears that 20% of customers have reported that they smoke. Can you verify whether this matches the national average, assuming the data was collected in 2010? We can also see that smoking appears a more common habit among males. Can you verify this?

EXERCISE: Visualize the distributions of the "sex", "region" and "children" columns and report your observations.

In [ ]:
 
In [ ]:
 
In [ ]:
 

Having looked at individual columns, we can now visualize the relationship between "charges" (the value we wish to predict) and other columns.

Age and Charges

Let's visualize the relationship between "age" and "charges" using a scatter plot. Each point in the scatter plot represents one customer. We'll also use values in the "smoker" column to color the points.

In [17]:
fig = px.scatter(medical_df, 
                 x='age', 
                 y='charges', 
                 color='smoker', 
                 opacity=0.8, 
                 hover_data=['sex'], 
                 title='Age vs. Charges')
fig.update_traces(marker_size=5)
fig.show()

We can make the following observations from the above chart:

  • The general trend seems to be that medical charges increase with age, as we might expect. However, there is significant variation at every age, and it's clear that age alone cannot be used to accurately determine medical charges.
  • We can see three "clusters" of points, each of which seems to form a line with an increasing slope:

    1. The first and the largest cluster consists primary of presumably "healthy non-smokers" who have relatively low medical charges compared to others

    2. The second cluster contains a mix of smokers and non-smokers. It's possible that these are actually two distinct but overlapping clusters: "non-smokers with medical issues" and "smokers without major medical issues".

    3. The final cluster consists exclusively of smokers, presumably smokers with major medical issues that are possibly related to or worsened by smoking.

EXERCISE: What other inferences can you draw from the above chart?

???

BMI and Charges

Let's visualize the relationship between BMI (body mass index) and charges using another scatter plot. Once again, we'll use the values from the "smoker" column to color the points.

In [18]:
fig = px.scatter(medical_df, 
                 x='bmi', 
                 y='charges', 
                 color='smoker', 
                 opacity=0.8, 
                 hover_data=['sex'], 
                 title='BMI vs. Charges')
fig.update_traces(marker_size=5)
fig.show()

It appears that for non-smokers, an increase in BMI doesn't seem to be related to an increase in medical charges. However, medical charges seem to be significantly higher for smokers with a BMI greater than 30.

What other insights can you gather from the above graph?

EXERCISE: Create some more graphs to visualize how the "charges" column is related to other columns ("children", "sex", "region" and "smoker"). Summarize the insights gathered from these graphs.

Hint: Use violin plots (px.violin) and bar plots (sns.barplot)

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Correlation

As you can tell from the analysis, the values in some columns are more closely related to the values in "charges" compared to other columns. E.g. "age" and "charges" seem to grow together, whereas "bmi" and "charges" don't.

This relationship is often expressed numerically using a measure called the correlation coefficient, which can be computed using the .corr method of a Pandas series.

In [19]:
medical_df.charges.corr(medical_df.age)
Out[19]:
0.29900819333064765
In [20]:
medical_df.charges.corr(medical_df.bmi)
Out[20]:
0.19834096883362892

To compute the correlation for categorical columns, they must first be converted into numeric columns.

In [21]:
smoker_values = {'no': 0, 'yes': 1}
smoker_numeric = medical_df.smoker.map(smoker_values)
medical_df.charges.corr(smoker_numeric)
Out[21]:
0.7872514304984772

Here's how correlation coefficients can be interpreted (source):

  • Strength: The greater the absolute value of the correlation coefficient, the stronger the relationship.

    • The extreme values of -1 and 1 indicate a perfectly linear relationship where a change in one variable is accompanied by a perfectly consistent change in the other. For these relationships, all of the data points fall on a line. In practice, you won’t see either type of perfect relationship.

    • A coefficient of zero represents no linear relationship. As one variable increases, there is no tendency in the other variable to either increase or decrease.

    • When the value is in-between 0 and +1/-1, there is a relationship, but the points don’t all fall on a line. As r approaches -1 or 1, the strength of the relationship increases and the data points tend to fall closer to a line.

  • Direction: The sign of the correlation coefficient represents the direction of the relationship.

    • Positive coefficients indicate that when the value of one variable increases, the value of the other variable also tends to increase. Positive relationships produce an upward slope on a scatterplot.

    • Negative coefficients represent cases when the value of one variable increases, the value of the other variable tends to decrease. Negative relationships produce a downward slope.

Here's the same relationship expressed visually (source):

The correlation coefficient has the following formula:

You can learn more about the mathematical definition and geometric interpretation of correlation here: https://www.youtube.com/watch?v=xZ_z8KWkhXE

Pandas dataframes also provide a .corr method to compute the correlation coefficients between all pairs of numeric columns.

In [22]:
medical_df.corr()
Out[22]:
age bmi children charges
age 1.000000 0.109272 0.042469 0.299008
bmi 0.109272 1.000000 0.012759 0.198341
children 0.042469 0.012759 1.000000 0.067998
charges 0.299008 0.198341 0.067998 1.000000

The result of .corr is called a correlation matrix and is often visualized using a heatmap.

In [23]:
sns.heatmap(medical_df.corr(), cmap='Reds', annot=True)
plt.title('Correlation Matrix');

Correlation vs causation fallacy: Note that a high correlation cannot be used to interpret a cause-effect relationship between features. Two features $X$ and $Y$ can be correlated if $X$ causes $Y$ or if $Y$ causes $X$, or if both are caused independently by some other factor $Z$, and the correlation will no longer hold true if one of the cause-effect relationships is broken. It's also possible that $X$ are $Y$ simply appear to be correlated because the sample is too small.

While this may seem obvious, computers can't differentiate between correlation and causation, and decisions based on automated system can often have major consequences on society, so it's important to study why automated systems lead to a given result. Determining cause-effect relationships requires human insight.

Let's save our work before continuing.

Linear Regression using a Single Feature

We now know that the "smoker" and "age" columns have the strongest correlation with "charges". Let's try to find a way of estimating the value of "charges" using the value of "age" for non-smokers. First, let's create a data frame containing just the data for non-smokers.

In [24]:
non_smoker_df = medical_df[medical_df.smoker == 'no']

Next, let's visualize the relationship between "age" and "charges"

In [25]:
plt.title('Age vs. Charges')
sns.scatterplot(data=non_smoker_df, x='age', y='charges', alpha=0.7, s=15);

Apart from a few exceptions, the points seem to form a line. We'll try and "fit" a line using this points, and use the line to predict charges for a given age. A line on the X&Y coordinates has the following formula:

$y = wx + b$

The line is characterized two numbers: $w$ (called "slope") and $b$ (called "intercept").

Model

In the above case, the x axis shows "age" and the y axis shows "charges". Thus, we're assume the following relationship between the two:

$charges = w \times age + b$

We'll try determine $w$ and $b$ for the line that best fits the data.

  • This technique is called linear regression, and we call the above equation a linear regression model, because it models the relationship between "age" and "charges" as a straight line.

  • The numbers $w$ and $b$ are called the parameters or weights of the model.

  • The values in the "age" column of the dataset are called the inputs to the model and the values in the charges column are called "targets".

Let define a helper function estimate_charges, to compute $charges$, given $age$, $w$ and $b$.

In [26]:
def estimate_charges(age, w, b):
    return w * age + b

The estimate_charges function is our very first model.

Let's guess the values for $w$ and $b$ and use them to estimate the value for charges.

In [27]:
w = 50
b = 100
In [28]:
ages = non_smoker_df.age
estimated_charges = estimate_charges(ages, w, b)

We can plot the estimated charges using a line graph.

In [29]:
plt.plot(ages, estimated_charges, 'r-o');
plt.xlabel('Age');
plt.ylabel('Estimated Charges');

As expected, the points lie on a straight line.

We can overlay this line on the actual data, so see how well our model fits the data.

In [30]:
target = non_smoker_df.charges

plt.plot(ages, estimated_charges, 'r', alpha=0.9);
plt.scatter(ages, target, s=8,alpha=0.8);
plt.xlabel('Age');
plt.ylabel('Charges')
plt.legend(['Estimate', 'Actual']);

Clearly, the our estimates are quite poor and the line does not "fit" the data. However, we can try different values of $w$ and $b$ to move the line around. Let's define a helper function try_parameters which takes w and b as inputs and creates the above plot.

In [31]:
def try_parameters(w, b):
    ages = non_smoker_df.age
    target = non_smoker_df.charges
    
    estimated_charges = estimate_charges(ages, w, b)
    
    plt.plot(ages, estimated_charges, 'r', alpha=0.9);
    plt.scatter(ages, target, s=8,alpha=0.8);
    plt.xlabel('Age');
    plt.ylabel('Charges')
    plt.legend(['Estimate', 'Actual']);
In [32]:
try_parameters(60, 200)
In [33]:
try_parameters(400, 5000)

EXERCISE: Try various values of $w$ and $b$ to find a line that best fits the data. What is the effect of changing the value of $w$? What is the effect of changing $b$?

In [ ]:
 
In [ ]:
 
In [ ]:
 

As we change the values, of $w$ and $b$ manually, trying to move the line visually closer to the points, we are learning the approximate relationship between "age" and "charges".

Wouldn't it be nice if a computer could try several different values of w and b and learn the relationship between "age" and "charges"? To do this, we need to solve a couple of problems:

  1. We need a way to measure numerically how well the line fits the points.

  2. Once the "measure of fit" has been computed, we need a way to modify w and b to improve the the fit.

If we can solve the above problems, it should be possible for a computer to determine w and b for the best fit line, starting from a random guess.

Loss/Cost Function

We can compare our model's predictions with the actual targets using the following method:

  • Calculate the difference between the targets and predictions (the differenced is called the "residual")
  • Square all elements of the difference matrix to remove negative values.
  • Calculate the average of the elements in the resulting matrix.
  • Take the square root of the result

The result is a single number, known as the root mean squared error (RMSE). The above description can be stated mathematically as follows:

Geometrically, the residuals can be visualized as follows:

Let's define a function to compute the RMSE.

In [34]:
import numpy as np
In [35]:
def rmse(targets, predictions):
    return np.sqrt(np.mean(np.square(targets - predictions)))

Let's compute the RMSE for our model with a sample set of weights

In [36]:
w = 50
b = 100
In [37]:
try_parameters(w, b)
In [38]:
targets = non_smoker_df['charges']
predicted = estimate_charges(non_smoker_df.age, w, b)
In [39]:
rmse(targets, predicted)
Out[39]:
8461.949562575493

Here's how we can interpret the above number: On average, each element in the prediction differs from the actual target by \$8461.

The result is called the loss because it indicates how bad the model is at predicting the target variables. It represents information loss in the model: the lower the loss, the better the model.

Let's modify the try_parameters functions to also display the loss.

In [40]:
def try_parameters(w, b):
    ages = non_smoker_df.age
    target = non_smoker_df.charges
    predictions = estimate_charges(ages, w, b)
    
    plt.plot(ages, predictions, 'r', alpha=0.9);
    plt.scatter(ages, target, s=8,alpha=0.8);
    plt.xlabel('Age');
    plt.ylabel('Charges')
    plt.legend(['Prediction', 'Actual']);
    
    loss = rmse(target, predictions)
    print("RMSE Loss: ", loss)
In [41]:
try_parameters(50, 100)
RMSE Loss:  8461.949562575493

EXERCISE: Try different values of $w$ and $b$ to minimize the RMSE loss. What's the lowest value of loss you are able to achieve? Can you come with a general strategy for finding better values of $w$ and $b$ by trial and error?

In [ ]:
 
In [ ]:
 
In [ ]:
 

Optimizer

Next, we need a strategy to modify weights w and b to reduce the loss and improve the "fit" of the line to the data.

Both of these have the same objective: to minimize the loss, however, while ordinary least squares directly computes the best values for w and b using matrix operations, while gradient descent uses a iterative approach, starting with a random values of w and b and slowly improving them using derivatives.

Here's a visualization of how gradient descent works:

Doesn't it look similar to our own strategy of gradually moving the line closer to the points?

Linear Regression using Scikit-learn

In practice, you'll never need to implement either of the above methods yourself. You can use a library like scikit-learn to do this for you.

Let's use the LinearRegression class from scikit-learn to find the best fit line for "age" vs. "charges" using the ordinary least squares optimization technique.

In [42]:
from sklearn.linear_model import LinearRegression

First, we create a new model object.

In [43]:
model = LinearRegression()

Next, we can use the fit method of the model to find the best fit line for the inputs and targets.

In [44]:
help(model.fit)
Help on method fit in module sklearn.linear_model.base:

fit(X, y, sample_weight=None) method of sklearn.linear_model.base.LinearRegression instance
    Fit linear model.
    
    Parameters
    ----------
    X : array-like or sparse matrix, shape (n_samples, n_features)
        Training data
    
    y : array_like, shape (n_samples, n_targets)
        Target values. Will be cast to X's dtype if necessary
    
    sample_weight : numpy array of shape [n_samples]
        Individual weights for each sample
    
        .. versionadded:: 0.17
           parameter *sample_weight* support to LinearRegression.
    
    Returns
    -------
    self : returns an instance of self.

Not that the input X must be a 2-d array, so we'll need to pass a dataframe, instead of a single column.

In [45]:
inputs = non_smoker_df[['age']]
targets = non_smoker_df.charges
print('inputs.shape :', inputs.shape)
print('targes.shape :', targets.shape)
inputs.shape : (1064, 1)
targes.shape : (1064,)

Let's fit the model to the data.

In [46]:
model.fit(inputs, targets)
Out[46]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

We can now make predictions using the model. Let's try predicting the charges for the ages 23, 37 and 61

In [47]:
model.predict(np.array([[23], 
                        [37], 
                        [61]]))
Out[47]:
array([ 4055.30443855,  7796.78921819, 14210.76312614])

Do these values seem reasonable? Compare them with the scatter plot above.

Let compute the predictions for the entire set of inputs

In [48]:
predictions = model.predict(inputs)
In [49]:
predictions
Out[49]:
array([2719.0598744 , 5391.54900271, 6727.79356686, ..., 2719.0598744 ,
       2719.0598744 , 3520.80661289])

Let's compute the RMSE loss to evaluate the model.

In [50]:
rmse(targets, predictions)
Out[50]:
4662.505766636395

Seems like our prediction is off by $4000 on average, which is not too bad considering the fact that there are several outliers.

The parameters of the model are stored in the coef_ and intercept_ properties.

In [51]:
# w
model.coef_
Out[51]:
array([267.24891283])
In [52]:
# b
model.intercept_
Out[52]:
-2091.420556565079

Are these parameters close to your best guesses?

Let's visualize the line created by the above parameters.

In [53]:
try_parameters(model.coef_, model.intercept_)
RMSE Loss:  4662.505766636395

Indeed the line is quite close to the points. It is slightly above the cluster of points, because it's also trying to account for the outliers.

EXERCISE: Use the SGDRegressor class from scikit-learn to train a model using the stochastic gradient descent technique. Make predictions and compute the loss. Do you see any difference in the result?

In [ ]:
 
In [ ]:
 
In [ ]:
 

EXERCISE: Repeat the steps is this section to train a linear regression model to estimate medical charges for smokers. Visualize the targets and predictions, and compute the loss.

In [ ]:
 
In [ ]:
 
In [ ]:
 

Machine Learning

Congratulations, you've just trained your first machine learning model! Machine learning is simply the process of computing the best parameters to model the relationship between some feature and targets.

Every machine learning problem has three components:

  1. Model

  2. Cost Function

  3. Optimizer

We'll look at several examples of each of the above in future tutorials. Here's how the relationship between these three components can be visualized:

As we've seen above, it takes just a few lines of code to train a machine learning model using scikit-learn.

In [54]:
# Create inputs and targets
inputs, targets = non_smoker_df[['age']], non_smoker_df['charges']

# Create and train the model
model = LinearRegression().fit(inputs, targets)

# Generate predictions
predictions = model.predict(inputs)

# Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)
Loss: 4662.505766636395

Let's save our work before continuing.

Linear Regression using Multiple Features

So far, we've used on the "age" feature to estimate "charges". Adding another feature like "bmi" is fairly straightforward. We simply assume the following relationship:

$charges = w_1 \times age + w_2 \times bmi + b$

We need to change just one line of code to include the BMI.

In [55]:
# Create inputs and targets
inputs, targets = non_smoker_df[['age', 'bmi']], non_smoker_df['charges']

# Create and train the model
model = LinearRegression().fit(inputs, targets)

# Generate predictions
predictions = model.predict(inputs)

# Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)
Loss: 4662.3128354612945

As you can see, adding the BMI doesn't seem to reduce the loss by much, as the BMI has a very weak correlation with charges, especially for non smokers.

In [56]:
non_smoker_df.charges.corr(non_smoker_df.bmi)
Out[56]:
0.08403654312833268
In [57]:
fig = px.scatter(non_smoker_df, x='bmi', y='charges', title='BMI vs. Charges')
fig.update_traces(marker_size=5)
fig.show()

We can also visualize the relationship between all 3 variables "age", "bmi" and "charges" using a 3D scatter plot.

In [58]:
fig = px.scatter_3d(non_smoker_df, x='age', y='bmi', z='charges')
fig.update_traces(marker_size=3, marker_opacity=0.5)
fig.show()

You can see that it's harder to interpret a 3D scatter plot compared to a 2D scatter plot. As we add more features, it becomes impossible to visualize all feature at once, which is why we use measures like correlation and loss.

Let's also check the parameters of the model.

In [59]:
model.coef_, model.intercept_
Out[59]:
(array([266.87657817,   7.07547666]), -2293.6320906488654)

Clearly, BMI has a much lower weightage, and you can see why. It has a tiny contribution, and even that is probably accidental. This is an important thing to keep in mind: you can't find a relationship that doesn't exist, no matter what machine learning technique or optimization algorithm you apply.

EXERCISE: Train a linear regression model to estimate charges using BMI alone. Do you expect it to be better or worse than the previously trained models?

In [ ]:
 
In [ ]:
 
In [ ]:
 

Let's go one step further, and add the final numeric column: "children", which seems to have some correlation with "charges".

$charges = w_1 \times age + w_2 \times bmi + w_3 \times children + b$

In [60]:
non_smoker_df.charges.corr(non_smoker_df.children)
Out[60]:
0.13892870453542197
In [61]:
fig = px.strip(non_smoker_df, x='children', y='charges', title= "Children vs. Charges")
fig.update_traces(marker_size=4, marker_opacity=0.7)
fig.show()
In [62]:
# Create inputs and targets
inputs, targets = non_smoker_df[['age', 'bmi', 'children']], non_smoker_df['charges']

# Create and train the model
model = LinearRegression().fit(inputs, targets)

# Generate predictions
predictions = model.predict(inputs)

# Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)
Loss: 4608.470405038247

Once again, we don't see a big reduction in the loss, even though it's greater than in the case of BMI.

EXERCISE: Repeat the steps is this section to train a linear regression model to estimate medical charges for smokers. Visualize the targets and predictions, and compute the loss.

In [ ]:
 
In [ ]:
 

EXERCISE: Repeat the steps is this section to train a linear regression model to estimate medical charges for all customers. Visualize the targets and predictions, and compute the loss. Is the loss lower or higher?

In [63]:
# Create inputs and targets
inputs, targets = medical_df[['age', 'bmi', 'children']], medical_df['charges']

# Create and train the model
model = LinearRegression().fit(inputs, targets)

# Generate predictions
predictions = model.predict(inputs)

# Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)
Loss: 11355.317901125973
In [ ]:
 
In [ ]:
 

Let's save our work before continuing.

Using Categorical Features for Machine Learning

So far we've been using only numeric columns, since we can only perform computations with numbers. If we could use categorical columns like "smoker", we can train a single model for the entire dataset.

To use the categorical columns, we simply need to convert them to numbers. There are three common techniques for doing this:

  1. If a categorical column has just two categories (it's called a binary category), then we can replace their values with 0 and 1.
  2. If a categorical column has more than 2 categories, we can perform one-hot encoding i.e. create a new column for each category with 1s and 0s.
  3. If the categories have a natural order (e.g. cold, neutral, warm, hot), then they can be converted to numbers (e.g. 1, 2, 3, 4) preserving the order. These are called ordinals

Binary Categories

The "smoker" category has just two values "yes" and "no". Let's create a new column "smoker_code" containing 0 for "no" and 1 for "yes".

In [64]:
sns.barplot(data=medical_df, x='smoker', y='charges');
In [65]:
smoker_codes = {'no': 0, 'yes': 1}
medical_df['smoker_code'] = medical_df.smoker.map(smoker_codes)
In [66]:
medical_df.charges.corr(medical_df.smoker_code)
Out[66]:
0.7872514304984772
In [67]:
medical_df
Out[67]:
age sex bmi children smoker region charges smoker_code
0 19 female 27.900 0 yes southwest 16884.92400 1
1 18 male 33.770 1 no southeast 1725.55230 0
2 28 male 33.000 3 no southeast 4449.46200 0
3 33 male 22.705 0 no northwest 21984.47061 0
4 32 male 28.880 0 no northwest 3866.85520 0
5 31 female 25.740 0 no southeast 3756.62160 0
6 46 female 33.440 1 no southeast 8240.58960 0
7 37 female 27.740 3 no northwest 7281.50560 0
8 37 male 29.830 2 no northeast 6406.41070 0
9 60 female 25.840 0 no northwest 28923.13692 0
10 25 male 26.220 0 no northeast 2721.32080 0
11 62 female 26.290 0 yes southeast 27808.72510 1
12 23 male 34.400 0 no southwest 1826.84300 0
13 56 female 39.820 0 no southeast 11090.71780 0
14 27 male 42.130 0 yes southeast 39611.75770 1
15 19 male 24.600 1 no southwest 1837.23700 0
16 52 female 30.780 1 no northeast 10797.33620 0
17 23 male 23.845 0 no northeast 2395.17155 0
18 56 male 40.300 0 no southwest 10602.38500 0
19 30 male 35.300 0 yes southwest 36837.46700 1
20 60 female 36.005 0 no northeast 13228.84695 0
21 30 female 32.400 1 no southwest 4149.73600 0
22 18 male 34.100 0 no southeast 1137.01100 0
23 34 female 31.920 1 yes northeast 37701.87680 1
24 37 male 28.025 2 no northwest 6203.90175 0
25 59 female 27.720 3 no southeast 14001.13380 0
26 63 female 23.085 0 no northeast 14451.83515 0
27 55 female 32.775 2 no northwest 12268.63225 0
28 23 male 17.385 1 no northwest 2775.19215 0
29 31 male 36.300 2 yes southwest 38711.00000 1
... ... ... ... ... ... ... ... ...
1308 25 female 30.200 0 yes southwest 33900.65300 1
1309 41 male 32.200 2 no southwest 6875.96100 0
1310 42 male 26.315 1 no northwest 6940.90985 0
1311 33 female 26.695 0 no northwest 4571.41305 0
1312 34 male 42.900 1 no southwest 4536.25900 0
1313 19 female 34.700 2 yes southwest 36397.57600 1
1314 30 female 23.655 3 yes northwest 18765.87545 1
1315 18 male 28.310 1 no northeast 11272.33139 0
1316 19 female 20.600 0 no southwest 1731.67700 0
1317 18 male 53.130 0 no southeast 1163.46270 0
1318 35 male 39.710 4 no northeast 19496.71917 0
1319 39 female 26.315 2 no northwest 7201.70085 0
1320 31 male 31.065 3 no northwest 5425.02335 0
1321 62 male 26.695 0 yes northeast 28101.33305 1
1322 62 male 38.830 0 no southeast 12981.34570 0
1323 42 female 40.370 2 yes southeast 43896.37630 1
1324 31 male 25.935 1 no northwest 4239.89265 0
1325 61 male 33.535 0 no northeast 13143.33665 0
1326 42 female 32.870 0 no northeast 7050.02130 0
1327 51 male 30.030 1 no southeast 9377.90470 0
1328 23 female 24.225 2 no northeast 22395.74424 0
1329 52 male 38.600 2 no southwest 10325.20600 0
1330 57 female 25.740 2 no southeast 12629.16560 0
1331 23 female 33.400 0 no southwest 10795.93733 0
1332 52 female 44.700 3 no southwest 11411.68500 0
1333 50 male 30.970 3 no northwest 10600.54830 0
1334 18 female 31.920 0 no northeast 2205.98080 0
1335 18 female 36.850 0 no southeast 1629.83350 0
1336 21 female 25.800 0 no southwest 2007.94500 0
1337 61 female 29.070 0 yes northwest 29141.36030 1

1338 rows × 8 columns

We can now use the smoker_df column for linear regression.

$charges = w_1 \times age + w_2 \times bmi + w_3 \times children + w_4 \times smoker + b$

In [68]:
# Create inputs and targets
inputs, targets = medical_df[['age', 'bmi', 'children', 'smoker_code']], medical_df['charges']

# Create and train the model
model = LinearRegression().fit(inputs, targets)

# Generate predictions
predictions = model.predict(inputs)

# Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)
Loss: 6056.439217188081

The loss reduces from 11355 to 6056, almost by 50%! This is an important lesson: never ignore categorical data.

Let's try adding the "sex" column as well.

$charges = w_1 \times age + w_2 \times bmi + w_3 \times children + w_4 \times smoker + w_5 \times sex + b$

In [69]:
sns.barplot(data=medical_df, x='sex', y='charges')
Out[69]:
<AxesSubplot:xlabel='sex', ylabel='charges'>
In [70]:
sex_codes = {'female': 0, 'male': 1}
In [71]:
medical_df['sex_code'] = medical_df.sex.map(sex_codes)
In [72]:
medical_df.charges.corr(medical_df.sex_code)
Out[72]:
0.0572920622020254
In [73]:
# Create inputs and targets
inputs, targets = medical_df[['age', 'bmi', 'children', 'smoker_code', 'sex_code']], medical_df['charges']

# Create and train the model
model = LinearRegression().fit(inputs, targets)

# Generate predictions
predictions = model.predict(inputs)

# Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)
Loss: 6056.100708754546

As you might expect, this does have a significant impact on the loss.

One-hot Encoding

The "region" column contains 4 values, so we'll need to use hot encoding and create a new column for each region.

In [74]:
sns.barplot(data=medical_df, x='region', y='charges');
In [75]:
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc.fit(medical_df[['region']])
enc.categories_
Out[75]:
[array(['northeast', 'northwest', 'southeast', 'southwest'], dtype=object)]
In [76]:
one_hot = enc.transform(medical_df[['region']]).toarray()
one_hot
Out[76]:
array([[0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       ...,
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.]])
In [77]:
medical_df[['northeast', 'northwest', 'southeast', 'southwest']] = one_hot
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-77-64d52aac2bad> in <module>
----> 1 medical_df[['northeast', 'northwest', 'southeast', 'southwest']] = one_hot

~\Anaconda3\envs\Winter\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3365             self._setitem_frame(key, value)
   3366         elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3367             self._setitem_array(key, value)
   3368         else:
   3369             # set column

~\Anaconda3\envs\Winter\lib\site-packages\pandas\core\frame.py in _setitem_array(self, key, value)
   3391                     self[k1] = value[k2]
   3392             else:
-> 3393                 indexer = self.loc._convert_to_indexer(key, axis=1)
   3394                 self._check_setitem_copy()
   3395                 self.loc._setitem_with_indexer((slice(None), indexer), value)

~\Anaconda3\envs\Winter\lib\site-packages\pandas\core\indexing.py in _convert_to_indexer(self, obj, axis, is_setter, raise_missing)
   1352                 kwargs = {'raise_missing': True if is_setter else
   1353                           raise_missing}
-> 1354                 return self._get_listlike_indexer(obj, axis, **kwargs)[1]
   1355         else:
   1356             try:

~\Anaconda3\envs\Winter\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1159         self._validate_read_indexer(keyarr, indexer,
   1160                                     o._get_axis_number(axis),
-> 1161                                     raise_missing=raise_missing)
   1162         return keyarr, indexer
   1163 

~\Anaconda3\envs\Winter\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1244                 raise KeyError(
   1245                     u"None of [{key}] are in the [{axis}]".format(
-> 1246                         key=key, axis=self.obj._get_axis_name(axis)))
   1247 
   1248             # We (temporarily) allow for some missing keys with .loc, except in

KeyError: "None of [Index(['northeast', 'northwest', 'southeast', 'southwest'], dtype='object')] are in the [columns]"
In [ ]:
medical_df

Let's include the region columns into our linear regression model.

$charges = w_1 \times age + w_2 \times bmi + w_3 \times children + w_4 \times smoker + w_5 \times sex + w_6 \times region + b$

In [ ]:
# Create inputs and targets
input_cols = ['age', 'bmi', 'children', 'smoker_code', 'sex_code', 'northeast', 'northwest', 'southeast', 'southwest']
inputs, targets = medical_df[input_cols], medical_df['charges']

# Create and train the model
model = LinearRegression().fit(inputs, targets)

# Generate predictions
predictions = model.predict(inputs)

# Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)

Once again, this leads to a fairly small reduction in the loss.

EXERCISE: Are two separate linear regression models, one for smokers and one of non-smokers, better than a single linear regression model? Why or why not? Try it out and see if you can justify your answer with data.

In [ ]:
 
In [ ]:
 
In [ ]:
 

Let's save our work before continuing.

Model Improvements

Let's discuss and apply some more improvements to our model.

Feature Scaling

Recall that due to regulatory requirements, we also need to explain the rationale behind the predictions our model.

$charges = w_1 \times age + w_2 \times bmi + w_3 \times children + w_4 \times smoker + w_5 \times sex + w_6 \times region + b$

To compare the importance of each feature in the model, our first instinct might be to compare their weights.

In [ ]:
model.coef_
In [ ]:
model.intercept_
In [ ]:
weights_df = pd.DataFrame({
    'feature': np.append(input_cols, 1),
    'weight': np.append(model.coef_, model.intercept_)
})
weights_df

While it seems like BMI and the "northeast" have a higher weight than age, keep in mind that the range of values for BMI is limited (15 to 40) and the "northeast" column only takes the values 0 and 1.

Because different columns have different ranges, we run into two issues:

  1. We can't compare the weights of different column to identify which features are important
  2. A column with a larger range of inputs may disproportionately affect the loss and dominate the optimization process.

For this reason, it's common practice to scale (or standardize) the values in numeric column by subtracting the mean and dividing by the standard deviation.

We can apply scaling using the StandardScaler class from scikit-learn.

In [ ]:
medical_df
In [ ]:
from sklearn.preprocessing import StandardScaler
In [ ]:
numeric_cols = ['age', 'bmi', 'children'] 
scaler = StandardScaler()
scaler.fit(medical_df[numeric_cols])
In [ ]:
scaler.mean_
In [ ]:
scaler.var_

We can now scale data as follows:

In [ ]:
scaled_inputs = scaler.transform(medical_df[numeric_cols])
scaled_inputs

These can now we combined with the categorical data

In [ ]:
cat_cols = ['smoker_code', 'sex_code', 'northeast', 'northwest', 'southeast', 'southwest']
categorical_data = medical_df[cat_cols].values
In [ ]:
inputs = np.concatenate((scaled_inputs, categorical_data), axis=1)
targets = medical_df.charges

# Create and train the model
model = LinearRegression().fit(inputs, targets)

# Generate predictions
predictions = model.predict(inputs)

# Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)

We can now compare the weights in the formula:

$charges = w_1 \times age + w_2 \times bmi + w_3 \times children + w_4 \times smoker + w_5 \times sex + w_6 \times region + b$

In [ ]:
weights_df = pd.DataFrame({
    'feature': np.append(numeric_cols + cat_cols, 1),
    'weight': np.append(model.coef_, model.intercept_)
})
weights_df.sort_values('weight', ascending=False)

As you can see now, the most important feature are:

  1. Smoker
  2. Age
  3. BMI

Creating a Test Set

Models like the one we've created in this tutorial are designed to be used in the real world. It's common practice to set aside a small fraction of the data (e.g. 10%) just for testing and reporting the results of the model.

In [ ]:
from sklearn.model_selection import train_test_split
In [ ]:
inputs_train, inputs_test, targets_train, targets_test = train_test_split(inputs, targets, test_size=0.1)
In [ ]:
# Create and train the model
model = LinearRegression().fit(inputs_train, targets_train)

# Generate predictions
predictions_test = model.predict(inputs_test)

# Compute loss to evalute the model
loss = rmse(targets_test, predictions_test)
print('Test Loss:', loss)

Let's compare this with the training loss.

In [ ]:
# Generate predictions
predictions_train = model.predict(inputs_train)

# Compute loss to evalute the model
loss = rmse(targets_train, predictions_train)
print('Training Loss:', loss)

Can you explain why the training loss is lower than the test loss? We'll discuss this in a lot more detail in future tutorials.

How to Approach a Machine Learning Problem

Here's a strategy you can apply to approach any machine learning problem:

  1. Explore the data and find correlations between inputs and targets
  2. Pick the right model, loss functions and optimizer for the problem at hand
  3. Scale numeric variables and one-hot encode categorical data
  4. Set aside a test set (using a fraction of the training set)
  5. Train the model
  6. Make predictions on the test set and compute the loss

We'll apply this process to several problems in future tutorials.

Let's save our work before continuing.

Summary and Further Reading

We've covered the following topics in this tutorial:

  • A typical problem statement for machine learning
  • Downloading and exploring a dataset for machine learning
  • Linear regression with one variable using Scikit-learn
  • Linear regression with multiple variables
  • Using categorical features for machine learning
  • Regression coefficients and feature importance
  • Creating a training and test set for reporting results

Apply the techniques covered in this tutorial to the following datasets:

Check out the following links to learn more about linear regression:

Revision Questions

  1. Why do we have to perform EDA before fitting a model to the data?
  2. What is a parameter?
  3. What is correlation?
  4. What does negative correlation mean?
  5. How can you find correlation between variables in Python?
  6. What is causation? Explain difference between correlation and causation with an example.
  7. Define Linear Regression.
  8. What is univariate linear regression?
  9. What is multivariate linear regression?
  10. What are weights and bias?
  11. What are inputs and targets?
  12. What is loss/cost function?
  13. What is residual?
  14. What is RMSE value? When and why do we use it?
  15. What is an Optimizer? What are different types of optimizers? Explain each with an example.
  16. What library is available in Python to perform Linear Regression?
  17. What is sklearn.linear_model ?
  18. What does model.fit() do? What arguments must be given?
  19. What does model.predict() do? What arguments must be given?
  20. How do we calculate RMSE values?
  21. What is model.coef_?
  22. What is model.intercept_?
  23. What is SGDRegressor? How is it different from Linear Regression?
  24. Define Machine Learning. What are the main components in Machine Learning?
  25. How does loss value help in determining whether the model is good or not?
  26. What are continuous and categorical variables?
  27. How do we handle categorical variables in Machine Learning? What are the common techniques?
  28. What is feature scaling? How does it help in Machine Learning?
  29. How do we perform scaling in Python?
  30. What is sklearn.preprocessing?
  31. What is a Test set?
  32. How do we split data for model fitting (training and testing) in Python?
  33. How do you approach a Machine Learning problem?